Regex support for hexadecimal and unicode escapes #1341

lmasroca · 2025-10-07T20:23:04Z

Added support for short hexadecimal escapes (\x00..\xff) and unicode escapes (\u0000..\uffff) for Java and JavaScript regular expressions.

added Java/JS support for regular expressions with hexadecimal and unicode escape sequences

arcuri82

@lmasroca @jgaleotti thx for this PR! but i m a bit confused about EOF... not saying it is wrong, but i don't understand why it was needed to be added, and what possible side effects it could have

arcuri82 · 2025-10-10T10:42:46Z

core/src/main/antlr4/org/evomaster/core/parser/RegexEcma262.g4

 // Parser rules have first letter in lower-case

-pattern : disjunction;
+pattern : disjunction EOF;


why this EOF?
how would it work when dealing with strings that don't have it?

arcuri82 · 2025-10-10T10:43:05Z

core/src/main/antlr4/org/evomaster/core/parser/RegexJava.g4

 // Parser rules have first letter in lower-case

-pattern : disjunction;
+pattern : disjunction EOF;


see previous comment

arcuri82 · 2025-10-10T10:44:07Z

core/src/main/kotlin/org/evomaster/core/parser/GeneRegexEcma262Visitor.kt


-        val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}$text")
+        // we remove the <EOF> token from end of the string to store as sourceRegex
+        val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}${text.substring(0,text.length - EOF_TOKEN.length)}")


what if the text does not have EOF?

arcuri82 · 2025-10-10T10:45:38Z

core/src/main/kotlin/org/evomaster/core/parser/GeneRegexJavaVisitor.kt


-        val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}$text")
+        // we remove the <EOF> token from end of the string to store as sourceRegex
+        val gene = RegexGene("regex", disjList,"${RegexGene.JAVA_REGEX_PREFIX}${text.substring(0, text.length - EOF_TOKEN.length)}")


see previous comment

lmasroca · 2025-10-14T21:53:49Z

@lmasroca @jgaleotti thx for this PR! but i m a bit confused about EOF... not saying it is wrong, but i don't understand why it was needed to be added, and what possible side effects it could have

By default, ANTLR4 tries to match as much input as possible according to the grammar rules. Without EOF, it may stop parsing after the longest valid match and silently ignore the rest. Adding EOF forces it to consume the entire input, which helps detect leftover or invalid tokens. This was needed for tests that intentionally feed invalid input. Regarding side effects, inputs containing invalid/unsupported input would now cause an exception instead of silently dropping part of the input. https://github.com/antlr/antlr4/blob/master/doc/parser-rules.md#start-rules-and-eof

arcuri82 · 2025-10-16T08:40:25Z

merged into #1349 to be able to run CI on it

External pr lmasroca from #1341

lmasroca and others added 12 commits August 7, 2025 16:33

Create test.txt

aafcf38

hexadecimal escapes

58e9f70

Merge branch 'WebFuzzing:master' into test

8aa402e

Merge branch 'WebFuzzing:master' into test

00c352b

Merge branch 'WebFuzzing:master' into test

5d6ed80

tests for invalid escapes

b88f1d2

added support for unicode escapes (e.g.: "\u0000", "\uaaaa", etc.)

931a4b0

small refactor

04ff031

small refactor

2305c16

hexadecimal and unicode escapes

bc2f5f6

added Java/JS support for regular expressions with hexadecimal and unicode escape sequences

Merge branch 'WebFuzzing:master' into master

ecd01c5

Merge branch 'WebFuzzing:master' into master

fd73c97

jgaleotti requested a review from arcuri82 October 7, 2025 20:25

arcuri82 reviewed Oct 10, 2025

View reviewed changes

arcuri82 changed the base branch from master to external-pr-lmasroca October 16, 2025 08:37

arcuri82 merged commit ab428c8 into WebFuzzing:external-pr-lmasroca Oct 16, 2025

arcuri82 added a commit that referenced this pull request Oct 17, 2025

Merge pull request #1349 from WebFuzzing/external-pr-lmasroca

473d02c

External pr lmasroca from #1341

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Regex support for hexadecimal and unicode escapes #1341

Regex support for hexadecimal and unicode escapes #1341

Uh oh!

lmasroca commented Oct 7, 2025

Uh oh!

arcuri82 left a comment

Uh oh!

arcuri82 Oct 10, 2025

Uh oh!

arcuri82 Oct 10, 2025

Uh oh!

arcuri82 Oct 10, 2025

Uh oh!

arcuri82 Oct 10, 2025

Uh oh!

lmasroca commented Oct 14, 2025

Uh oh!

arcuri82 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Regex support for hexadecimal and unicode escapes #1341

Regex support for hexadecimal and unicode escapes #1341

Uh oh!

Conversation

lmasroca commented Oct 7, 2025

Uh oh!

arcuri82 left a comment

Choose a reason for hiding this comment

Uh oh!

arcuri82 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

arcuri82 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

arcuri82 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

arcuri82 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

lmasroca commented Oct 14, 2025

Uh oh!

arcuri82 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants